Clustering Optical Water Types
This post describes an attempt at reproducing a study of ocean color optical water types (OWT). Here, remote sensing reflectance (Rrs) was used to classify regions of the ocean according to their apparent optical properties The author uses fuzzy c-means (FCM) clustering. However, K-means clustering yields results that are pretty close to FMC. So that's what I'll use here.
import os
from scipy.io import loadmat as LM
import matplotlib.pyplot as pl
from matplotlib.colors import BoundaryNorm as BN
from mpl_toolkits.basemap import Basemap
import matplotlib.cm as cm
from matplotlib import rcParams
import seaborn as SB
import pandas as PD
import numpy as NP
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.preprocessing import StandardScaler
%matplotlib inline
# setting precision across the notebook to 2 decimals
%precision 3
Additional graphic settings
rcParams['axes.formatter.limits']=(-2,3)
rcParams['xtick.labelsize'] = 14
rcParams['ytick.labelsize'] = 14
rcParams['font.size'] = 16
# Specifying data file paths
maindir = '/home/madhatter106/DATA/OceanColor/OWT/'
fname = 'nomad_rrs4clustering.mat'
fpath = os.path.join(maindir,fname)
This is a MATLAB data file, so I'll use scipy's loadmat method to load a MATLAB namespace into a python dictionary.
matlabData = LM(fpath)
Taking a look at the dictionary keys...
matlabData.keys()
Dictionary keys are:
- wl: wavelength
- lat_nomad: latitude
- lon_nomad: longitude
- chl_nomad: chlorophyll a
- rrs_below: sub-surface
I'll assign the data to separate containers.
wavelength = matlabData['wl']
lat = matlabData['lat_nomad']
lon = matlabData['lon_nomad']
rrsBelow = matlabData['rrs_below']
chl = matlabData['chl_nomad'] # I might use this as an additional feature in clustering
geoLbl = ['lat', 'lon']
wlLbl = list(wavelength[0])
prodLbl = ['chl']
labels = geoLbl + wlLbl + prodLbl
df = PD.DataFrame(data=NP.hstack((lat,lon,rrsBelow,chl)), columns=labels)
df.head()
df.describe()
For context, and since geolocation is provided, I'll map these data
df.info()
One of the requirements of this exercises is to have data from as varied water bodies as possible. Let's take a quick look at where the samples come from.
fig,ax=pl.subplots(figsize=(12,10))
m = Basemap(projection='kav7',lon_0=0,ax=ax)
x,y=m(df.lon.values,df.lat.values)
m.drawmapboundary(fill_color='#99ffff')
m.fillcontinents(color='#cc9966',lake_color='#99ffff')
m.scatter(x,y,50,marker='o',color='k',alpha=0.5);
Finally, I want to have a sense of how the each feature relates to each other. Seaborn's pairplot method is quite handy for this. Hereafter I only use Rrs data, as that is what I will use for the clustering part.
SB.set(context='notebook',font_scale=3)
SB.pairplot(df.loc[:,wlLbl]);
The plot above shows high correlation among some bands, mostly between those that are close to each other. I'll next proceed with standardizing the data. Then, I proceed with PCA.
sc = StandardScaler()
dataStd = sc.fit_transform(df.loc[:,wlLbl].values)
dfStd = PD.DataFrame(dataStd,columns=wlLbl)
Now to make sure that this had the desired effect (mean~0, stdev~1)
dfStd.describe()
def PlotPCARes(classifier, data, threshold=0.85):
"""This function plots PCA results"""
f,ax = pl.subplots(ncols=2,figsize=(12,6))
n = data.shape[1]
ax[0].bar(range(1, n+1),classifier.explained_variance_ratio_, align='center', color='skyblue');
ax[0].step(range(1, n+1),NP.cumsum(classifier.explained_variance_ratio_), where='mid')
ax[0].hlines(threshold, 0, n+2, linestyles='--', linewidth=2, label='feature selection cutoff')
ax[0].set_xticks(NP.arange(1,n+1))
ax[0].set_xticklabels(['PC%d' %i for i in range(1,n+1)],rotation=45)
ax[0].set_xlim((0,1+n))
ax[0].set_ylim((0,1))
ax[0].set_title('PCA Summary')
ax[0].legend(loc='best')
ax[1].set_title('First Two PCs')
ax[1].scatter(data[:,0], data[:,1], alpha=0.1,edgecolor='k', linewidth=2);
ax[1].set_xlabel('PC1')
ax[1].set_ylabel('PC2')
pca = PCA()
pcData = pca.fit_transform(dataStd)
Next, I print the explained variance for each principal component (PC). Note that sklearn has already sorted them accordingly. The
print("explained variance, by PC")
print(pca.explained_variance_)
print('-' * 80)
print("relative explained variance by PC")
print(pca.explained_variance_ratio_)
I'm going to use a cumulative 90% in relative explained variance to determine the principal components that I will use for clustering. The first two cluster account for 96% so I'll use PC1 and PC2. Here's a graphical rendering of PCs with the decision boundary, and a scatter plot of the first two components
SB.set(context='notebook',font_scale=1)
PlotPCARes(pca, pcData, threshold=0.9)
SB.set_style(rc=rcParams)
The data resulting from PCA is no longer standardized. I will therefore standardize the first two PCs before passing them to the KMeans algorithm.
sc2 = StandardScaler()
pcFirst2Std = sc2.fit_transform(pcData[:,:2])
Next is a bit of a long-winded code bit that goes through a number of cluster numbers, with corresponding figures featuring silhouette plots on the left data color-coded by cluster membership, along with cluster centroids. I'll also store silhouette scores and within cluster sum of squares. The former detailing how distinct any given point is from the closest centroid it is not a member of; the latter describing how compact each cluster is. Quite a bit the code below was lifted off from here.
wInClustSumSq = []
silScore = []
centroids = []
maxClust = 20
for iClust in range(2,maxClust+1):
f,axs = pl.subplots(ncols=2,figsize=(12,6))
# silhouette scores range from -1 to 1, but in this case the range is (0.25 to 0.65)
axs[0].set_ylim([0, len(pcFirst2Std) + (iClust + 1) * 10]) # this to
km = KMeans(n_clusters=iClust, init='k-means++', n_init=10, max_iter=500, random_state=0, n_jobs=-1)
km.fit(pcFirst2Std)
labels=km.labels_
wInClustSumSq.append(km.inertia_)
silScore.append(silhouette_score(pcFirst2Std, labels, metric='euclidean'))
centroids.append(km.cluster_centers_)
sample_silhouette_values = silhouette_samples(pcFirst2Std, labels)
silhouette_avg = silhouette_score(pcFirst2Std, labels)
y_lower = 10
for i in range(iClust):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.spectral(float(i) / iClust)
axs[0].fill_betweenx(NP.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
axs[0].text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
if iClust == 2:
axs[0].set_title("The silhouette plot for the various clusters.")
axs[1].set_title("The visualization of the clustered data.")
if iClust == maxClust:
axs[0].set_xlabel("The silhouette coefficient values")
axs[1].set_xlabel("Feature space for the 1st feature")
axs[0].set_ylabel("Cluster label")
axs[0].axvline(x=silhouette_avg, color="red", linestyle="--")
axs[0].set_yticks([]) # Clear the yaxis labels / ticks
axs[0].set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
axs[0].set_xlim([-.25, .7])
# 2nd Plot showing the actual clusters formed
colors = cm.spectral(labels.astype(float) / iClust)
axs[1].scatter(pcFirst2Std[:, 0], pcFirst2Std[:, 1], marker='.', s=30, lw=0, alpha=0.7,
c=colors)
# Labeling the clusters
centers = km.cluster_centers_
# Draw white circles at cluster centers
axs[1].scatter(centers[:, 0], centers[:, 1],
marker='o', c="white", alpha=1, s=200)
for i, c in enumerate(centers):
axs[1].scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)
axs[1].set_ylabel("Feature space for the 2nd feature")
pl.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % iClust),
fontsize=14, fontweight='bold')
The plots above on the left hand side are an example of silhouette analysis. Silhouette coefficients indicate how tightly packed each cluster is. Additionally the average silhouette score is plotted (here as a red dashed line). The criteria for evaluating a particular clustering scheme include the following:
- all clusters above the average silhouette score,
- clusters relatively equal in size
- as few "possibly badly clustered points" (silhouette score < 0) as possible
As can be seen in the figures above, real world data tends to be messy and the application of these criteria can be a bit fuzzy. But as long as the thought process is plainly expressed (making it debatable and revisable). No stress required.
The criteria above suggest that a 3-cluster partition works best. We can try to confirm that using the elbow method, as well as plotting the average silhouette score; both as a function of cluster number.
f,axs = pl.subplots(ncols=2,figsize=(15,6))
axs[0].plot(range(2, maxClust+1), wInClustSumSq, marker='o')
axs[0].set_xlabel('cluster number', fontsize=14)
axs[0].set_ylabel('within cluster sum-of-squares', fontsize=14);
axs[1].plot(range(2, maxClust+1), silScore, marker='o')
axs[1].set_xlabel('cluster number',fontsize=14)
axs[1].set_ylabel('silhouette score', fontsize=14);
The plots above seem to confirm the earlier diagnosis. There's an obvious elbow for a 3-cluster arrangement, which incidentally has the highest silhouette score. One might argue that there is a secondary elbow around 5-6 clusters, and there is a secondary peak around cluster 6 in the silhouette score plot. Silhouette analysis for these partitions shows disproportionate clusters. This may not necessarily discard these alternatives however, as some water types may be underrepresented depending on when/where they were sampled. I will therefore retain labels for 3-, 5- and 6-cluster partitions, to see the resulting spatial distribution on the world map above. Note that the 4-cluster arrangement is not retained because one of the cluster is just below the corresponding average silhouette score. The next step is then to re-run KMeans thrice.
clusterNums = [3,5,6]
labelsDict = {}
for iclust in clusterNums:
km = KMeans(n_clusters=iclust, init='k-means++', n_init=10, max_iter=500, random_state=0, n_jobs=-1)
km.fit(pcFirst2Std)
labelsDict['%d-cluster' % iclust] = km.labels_
Now I merge these labels into the original pandas dataframe.
dflabels=PD.DataFrame(labelsDict,columns=labelsDict.keys())
dfnew=PD.concat((df,dflabels),axis=1)
dfnew.head()
Now we can plot :)
fig,axs=pl.subplots(nrows=3,figsize=(12,21))
for ax,clnum in zip(axs,clusterNums):
cmJet = pl.get_cmap('jet',clnum )
# mappable = pl.cm.ScalarMappable(cmap=cmJ)
m = Basemap(projection='kav7',lon_0=0,ax=ax)
x,y=m(dfnew.lon.values,dfnew.lat.values)
m.drawmapboundary(fill_color='#99ffff')
m.fillcontinents(color='#cc9966',lake_color='#99ffff')
bounds = NP.arange(clnum+1)
loc=bounds+0.5
norm = BN(boundaries=bounds, ncolors=cmJet.N)
sc = m.scatter(x,y,50,marker='o',c=dfnew['%d-cluster' % clnum],cmap=cmJet,alpha=0.7,norm=norm);
ax.set_title('%d Clusters' % clnum, fontsize=16)
cb = pl.colorbar(sc, ax=ax,)
cb.set_ticks(loc)
cb.set_ticklabels(bounds)
The clustering appears somewhat consistent. In the case of the 3-cluster partition, clusters #0 is consistently observed in coastal waters; cluster #1 appears mostly in what one might consider oceanic waters; cluster #2 seems more often associated with boundary currents.
Similar, though more nuanced observations can be made with the 5- and 6-cluster partition. Any meaningful conclusions however would require more information about the specific sampling locations. One such data that is available; chlorophyll a. Next, I plot chlorophyll with respect to cluster label for each of partitions selected.
f,axs = pl.subplots(nrows=3,figsize=(8,12))
for ax, clnum in zip(axs,clusterNums):
SB.boxplot(x='%d-cluster' % clnum, y='chl', data=dfnew, ax=ax,notch=True)
ax.set_yscale('log')
ax.set_ylim=((1e-4,1e2))
Here the notches in the boxplot indicate a 95% confidence interval around the median (horizontal line inside the box). If the notches between two boxes don't overlap, I'll assume they are significantly different. In the 3-cluster partition, there's a significant difference in chlorophyll between cluster #1 and the other two clusters. This distinction agrees with the spatial distribution of each label. In the 5- and 6-cluster schemes, while such distinctions do occur, many of the boxes are overlapping, suggesting that chlorophyll a concentration is insufficient to tease out the significance of each cluster. Here, further interpretation would benefit from additional information about the samples assigned to each cluster.
In this post, I attempted to reproduce the results of a fuzzy logic clustering study (Moore et al., 2001), using a simpler technique; namely k-means clustering. Moreover, in this post, I use more Rrs bands as initial input. I also go through a data reduction step by extracting the principal components. These become the input to the clustering step. By running post-clustering diagnostics I select three clustering partitions; 3-, 5- and 6- clustering partitions. Moore et al. (2001) claim a 6-cluster partition to be the more optimal selection. A 6-cluster partition could very well be appropriate. However, I find it difficult given what I know of the samples used in this study to justify anything other than a 3-cluster partition.
Furthermore, using apparent optical properties may not be the best way to characterize a water sample. Inherent optical properties that more directly reflect seawater properties may be more appropriate as features for categorizing water types.
Finally, this is obviously not the final word on this topic. The jupyter notebook being an easily shareable and useable platform making iteasy for anyone to download the source notebook and play with it. Comments and suggestions for improvement are welcome!